The penguin dataset is composed of 344 observations with 8 variables, 5 of which are numeric and 3 which are qualitative. The dataset is mostly complete with just a few observations with missing values that will need to be handled.
| Name | data |
| Number of rows | 344 |
| Number of columns | 8 |
| _______________________ | |
| Column type frequency: | |
| factor | 3 |
| numeric | 5 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| species | 0 | 1.00 | FALSE | 3 | Ade: 152, Gen: 124, Chi: 68 |
| island | 0 | 1.00 | FALSE | 3 | Bis: 168, Dre: 124, Tor: 52 |
| sex | 11 | 0.97 | FALSE | 2 | mal: 168, fem: 165 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| bill_length_mm | 2 | 0.99 | 43.92 | 5.46 | 32.1 | 39.23 | 44.45 | 48.5 | 59.6 | ▃▇▇▆▁ |
| bill_depth_mm | 2 | 0.99 | 17.15 | 1.97 | 13.1 | 15.60 | 17.30 | 18.7 | 21.5 | ▅▅▇▇▂ |
| flipper_length_mm | 2 | 0.99 | 200.92 | 14.06 | 172.0 | 190.00 | 197.00 | 213.0 | 231.0 | ▂▇▃▅▂ |
| body_mass_g | 2 | 0.99 | 4201.75 | 801.95 | 2700.0 | 3550.00 | 4050.00 | 4750.0 | 6300.0 | ▃▇▆▃▂ |
| year | 0 | 1.00 | 2008.03 | 0.82 | 2007.0 | 2007.00 | 2008.00 | 2009.0 | 2009.0 | ▇▁▇▁▇ |
The target variable of interest is the species of penguins, which are categorized into three groups: Adelie, Gentoo and Chinstrap penguins.
## [1] Adelie Gentoo Chinstrap
## Levels: Adelie Chinstrap Gentoo
From this plot, we can make a few key observations:
These island observations are valuable information in differentiating penguin species.
However, the sex of the penguins does not offer much information as the proportion is about even across all species. We can also note a few missing observations labeled as NA.
We noted from the data summary above that 11 observations were missing for the sex variable. There is also no reason to believe that the year the observation was taken would have any impact on the morphology of the penguins. We are not looking for any time series modeling. Therefore, we also drop year from our predictor variables. There are also two observations which are missing body measurements altogether, so these rows will be dropped altogether.
When looking at body measurements we see that Adelie and Chinstrap penguins largely overlap except for bill_length. This suggests that we might be able to use bill_depth, body_mass and flipper_length to differentiate the Gentoo penguins from the other species. However, the Adelie penguin stands out from the other others in bill_length
Using the featurePlot function from the caret package we can easily display data distributions such as the scatter plot matrix similar to the one above.
We see on the univariate feature plots below that the data is aproximatelly normally distibuted. This is an important assumption of LDA and QDA.
Taking a look at the correlation matrix below, we can make a few observations, notably that flipper_length is highly positively correlated with body_mass which makes sense given that larger penguins should have larger flippers. The other correlations are less obvious to interpret. Given that the dataset only contains a few predictors, we choose not to exclude any variables based on multicollinearity at this time.
The data is split into training and testing sets 80%/20%. The test set contains 65 observations.
set.seed(622)
trainIndex <- createDataPartition(data$species, p = .8, list = FALSE, times = 1)
training <- data[ trainIndex,]
testing <- data[-trainIndex,]
dim(testing)## [1] 65 6
LDA assumes that the predictors are distributed as multivariate gaussian with common covariance. Based on the distribution plot above, LDA seems to be a good fit.
By the proportion of trace, we can see that the first linear discriminant LD1 achieves 87.7% separation and the second, LDA 12.3%.
## Call:
## lda(species ~ ., data = training)
##
## Prior probabilities of groups:
## Adelie Chinstrap Gentoo
## 0.4365672 0.2052239 0.3582090
##
## Group means:
## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sexmale
## Adelie 38.76325 18.37521 189.4786 3688.675 0.4957265
## Chinstrap 48.84545 18.44182 196.0364 3757.727 0.4727273
## Gentoo 47.73125 15.03125 217.5208 5104.948 0.5104167
##
## Coefficients of linear discriminants:
## LD1 LD2
## bill_length_mm 0.132134680 -0.413367283
## bill_depth_mm -0.937321597 -0.203363198
## flipper_length_mm 0.088727372 0.011625232
## body_mass_g 0.001439478 0.001402053
## sexmale -0.600470011 0.867736702
##
## Proportion of trace:
## LD1 LD2
## 0.8724 0.1276
The LDA model performed very well with 100% accuracy on the test set. Another iteration of this model that omitted the sex variable yielded worst performance with 96.9% accuracy and 2 misclassified penguins. This shows that sex has some predictive power and should be included in the model.
scores <- data.frame()
cm.lda <- confusionMatrix(lda.pred$class, testing$species)
lda.acc <- cm.lda[[3]][1]
scores <-rbind(scores, data.frame(model="LDA", accuracy=lda.acc))
cm.lda## Confusion Matrix and Statistics
##
## Reference
## Prediction Adelie Chinstrap Gentoo
## Adelie 29 0 0
## Chinstrap 0 13 0
## Gentoo 0 0 23
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9448, 1)
## No Information Rate : 0.4462
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity 1.0000 1.0 1.0000
## Specificity 1.0000 1.0 1.0000
## Pos Pred Value 1.0000 1.0 1.0000
## Neg Pred Value 1.0000 1.0 1.0000
## Prevalence 0.4462 0.2 0.3538
## Detection Rate 0.4462 0.2 0.3538
## Detection Prevalence 0.4462 0.2 0.3538
## Balanced Accuracy 1.0000 1.0 1.0000
The partition plot below helps to visualize LDA. The observations that end up misclassified are shown in red. In the case of LDA, the decision boundary is linear.
Unlike LDA, QDA assumes that each class has its own covariance matrix. As a result, the decision boundary is quadratic.
## Call:
## qda(species ~ ., data = training)
##
## Prior probabilities of groups:
## Adelie Chinstrap Gentoo
## 0.4365672 0.2052239 0.3582090
##
## Group means:
## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sexmale
## Adelie 38.76325 18.37521 189.4786 3688.675 0.4957265
## Chinstrap 48.84545 18.44182 196.0364 3757.727 0.4727273
## Gentoo 47.73125 15.03125 217.5208 5104.948 0.5104167
The QDA model performed even better with 98.5% accuracy and only 1 observation misclassified. The QDA model performs the same when the categorical variable sex was omitted.
qda.pred <- predict(qda.fit, testing)
cm.qda <- confusionMatrix(qda.pred$class, testing$species)
qda.acc <- cm.qda[[3]][1]
scores <-rbind(scores, data.frame(model="QDA", accuracy=qda.acc))
cm.qda## Confusion Matrix and Statistics
##
## Reference
## Prediction Adelie Chinstrap Gentoo
## Adelie 28 0 0
## Chinstrap 1 13 0
## Gentoo 0 0 23
##
## Overall Statistics
##
## Accuracy : 0.9846
## 95% CI : (0.9172, 0.9996)
## No Information Rate : 0.4462
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9759
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity 0.9655 1.0000 1.0000
## Specificity 1.0000 0.9808 1.0000
## Pos Pred Value 1.0000 0.9286 1.0000
## Neg Pred Value 0.9730 1.0000 1.0000
## Prevalence 0.4462 0.2000 0.3538
## Detection Rate 0.4308 0.2000 0.3538
## Detection Prevalence 0.4308 0.2154 0.3538
## Balanced Accuracy 0.9828 0.9904 1.0000
Unlike the LDA partition plot, we see below that the decision boundary is quadratic. We can note in particular that the curvature of the pink region in the sixth subplot helps limit the number of Chinstrap penguins classified as Gentoo better than LDA did for the same plot. In this case, it seems that even with QDA most decision boundaries remain nearly linear.
The Naive Bayes classifier assumes that all features are equally important and independent which is often not the case and may result in some bias. However, the assumption of independence simplifies the comptutations by turning conditional probabilities into products of probabilities. Here we do not need to determine the exact posterior probability but simply which class is more likely.
The classification result for Naive Bayes on the test set yielded 96.7% accuracy and two misclassified penguins.
features <- setdiff(names(training), "species")
x <- training[,features]
y <- training$species
model.naive <- naiveBayes(x = x,y = y, laplace = 1)
result.naive <- predict(model.naive, testing %>% dplyr::select(-species))
# Make confusion matrix
cm.naive <- confusionMatrix(result.naive, testing$species)
nb.acc <- cm.naive[[3]][1]
scores <-rbind(scores, data.frame(model="NB", accuracy=nb.acc))
cm.naive## Confusion Matrix and Statistics
##
## Reference
## Prediction Adelie Chinstrap Gentoo
## Adelie 28 1 0
## Chinstrap 1 12 0
## Gentoo 0 0 23
##
## Overall Statistics
##
## Accuracy : 0.9692
## 95% CI : (0.8932, 0.9963)
## No Information Rate : 0.4462
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9516
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity 0.9655 0.9231 1.0000
## Specificity 0.9722 0.9808 1.0000
## Pos Pred Value 0.9655 0.9231 1.0000
## Neg Pred Value 0.9722 0.9808 1.0000
## Prevalence 0.4462 0.2000 0.3538
## Detection Rate 0.4308 0.1846 0.3538
## Detection Prevalence 0.4462 0.2000 0.3538
## Balanced Accuracy 0.9689 0.9519 1.0000
The data frame below summarizes the model accuracies from above. LDA was the best performing model on the test set. However, the test set only comprised of 65 observations. We can expand our model evaluation on the test with alternate train-test data splits.
The function below is created to test our models from above on more observations. By iterating over 100 train/test splits of the data, we can evaluate the performance of the models on more unseen data. Note that new models are fitted to each new split of the data.
set.seed(25)
sim_test_set <- function() {
sim_scores <- data.frame()
features <- setdiff(names(training), "species")
for (i in seq(1:100)) {
trainIndex <- createDataPartition(data$species, p = .8, list = FALSE, times = 1)
training <- data[ trainIndex,]
testing <- data[-trainIndex,]
# LDA
lda.fit <- lda(species ~ ., data=training)
lda.pred <- predict(lda.fit, testing)
cm.lda <- confusionMatrix(lda.pred$class, testing$species)
lda.acc <- cm.lda[[3]][1]
# QDA
qda.fit <- qda(species ~ ., data=training)
qda.pred <- predict(qda.fit, testing)
cm.qda <- confusionMatrix(qda.pred$class, testing$species)
qda.acc <- cm.qda[[3]][1]
# Naive Bayes
x <- training[,features]
y <- training$species
model.naive <- naiveBayes(x = x,y = y, laplace = 1)
result.naive <- predict(model.naive, testing %>% dplyr::select(-species))
cm.naive <- confusionMatrix(result.naive, testing$species)
nb.acc <- cm.naive[[3]][1]
sim_scores <-rbind(sim_scores, data.frame(lda=lda.acc, qda=qda.acc, nb=nb.acc))
}
return(sim_scores)
}The performance is summarized in the boxplot above. While the performance of the LDA models was the best over the original test set, we see here that over 100 data splits, LDA and QDA models are nearly identical. The Naive Bayes model tend to have lower accuracy and a longer lower tail than LDA.
In general LDA models are more stable than logistic regression when the classes are well separated and the features are approximately normal, which is the case here. However, by assuming that the predictors are distributed as multivariate gaussian with common covariance we are limiting the flexibility of the fit. When comparing partition plots between LDA and QDA we saw that most QDA boundaries were nearly linear showing that the LDA assumptions nearly hold. The correlation between bill depth and flipper length as well as body weight and flipper length was an early indication that QDA might be a better fit. However, QDA provided no meaningful gain in performance over LDA. Therefore we prefer the LDA model which is more parsimious since it has fewer covariance parameters to estaimte.
While the Naive Bayes model is easy to implement it likely suffers in our case from the assumption of feature independence. As stated above, some of the predictor variables were correlated, thereby violating the assumption and introducing bias.